Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCM/UCS: Fail to create memtype cache if cannot patch Cuda driver API #7865

Merged

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Jan 22, 2022

Why

Many Cuda applications use static link, so it's not safe to assume that relocation-based memory hooks on Cuda runtime API can be enough. To be on the safe safe, fail to create the registration cache and (by default) fallback to pointer-query based memory detection.
Note: As a side effect, this will also disable RDMA registration cache when could not install Cuda hooks, since not knowing about a Cuda memory release can lead to stale memory keys in the cache and a data corruption.

How

  • Use Cuda Bistro hooks by default
  • UCX_MEMTYPE_CACHE can by yes/no/try (default: try). If set to yes and it could not be created - print a warning.
  • Memtype cache fails to create if could not set memtype hooks

The logic is:

  • UCX_MEM_CUDA_HOOK_MODE=bistro (default):
    • If bistro hooks on driver API fail - registration cache and memory type cache are not created.
      If also UCX_MEMTYPE_CACHE=y (default is try) - a warning is printed.
  • UCX_MEM_CUDA_HOOK_MODE=reloc:
    This method sets reloc hooks on both driver and runtime APIs.
    If either driver or runtime API reloc hooks fail (it's not expected) - memtype cache and rcache are disabled.
    This method is safe only if the application is dynamically linked to cuda runtime API (or using the driver API directly), so it's not the default.

Debugging

Set "UCX_MEM_LOG_LEVEL=diag" to get more info if failed to install memory hooks.

Related to #7791 (comment)
cc @Akshay-Venkatesh @pentschev

@yosefe yosefe force-pushed the topic/ucm-ucs-fail-to-create-memtype-cache branch from 7c67e7b to 537eb4b Compare January 22, 2022 21:42
@yosefe yosefe added the Bugfix label Jan 22, 2022
@yosefe yosefe force-pushed the topic/ucm-ucs-fail-to-create-memtype-cache branch from 537eb4b to 8e1b3be Compare January 23, 2022 10:29
@shamisp
Copy link
Contributor

shamisp commented Jan 23, 2022

@yosefe is this only for static or this will impact dynamic linked cuda applications as well ?

@yosefe
Copy link
Contributor Author

yosefe commented Jan 23, 2022

@yosefe is this only for static or this will impact dynamic linked cuda applications as well ?

it will affect both, to use only driver-api hooks instead of runtime/driver.

@shamisp
Copy link
Contributor

shamisp commented Jan 23, 2022

@yosefe in the past you have been avoiding this because of high cost (system call). Does it mean that performance will go down ?

@yosefe
Copy link
Contributor Author

yosefe commented Jan 23, 2022

@yosefe in the past you have been avoiding this because of high cost (system call). Does it mean that performance will go down ?

Do you mean memory type detection?
We still use memtype cache for detection (if possible); it's just that instead of hooking both driver and runtime APIs for it, we actually need to hook only the driver API.
The latest observation is that we really want to ensure the driver API hooks are installed [since any app can be static link potentially], so there is no point of hooking the upper-layer runtime APIs.

@shamisp
Copy link
Contributor

shamisp commented Jan 23, 2022

@yosefe got it. thanks

@yosefe
Copy link
Contributor Author

yosefe commented Jan 26, 2022

@Akshay-Venkatesh @bureddy can you pls take a look?

@yosefe
Copy link
Contributor Author

yosefe commented Jan 27, 2022

per offline discussion with @bureddy and @Akshay-Venkatesh : will keep reloc hooks for Cuda runtime API as a disabled option, for applications that use dynamic link and fail to patch driver API for some reason.

@yosefe
Copy link
Contributor Author

yosefe commented Feb 3, 2022

@Akshay-Venkatesh @bureddy can you pls take a look?

@Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh @bureddy can you pls take a look?

Sorry for the delay @yosefe Taking a look now.

Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe Probably missed this is in the diff but
1. Can you point to how driver hooks installation failure prevents memtype_cache from being created?

  1. Is it the case that memtype_cache creation calls ucm_cudamem_install (which doesn't return UCS_OK if driver hooks failed) and this leads to disabling memtype cache when bistro and reloc methods fail on driver functions?
    2. Is it the case that memtype_cache being absent prevents rcache instance from being created and avoids potential data corruption?

goto out_unlock;
}

status = ucm_cuda_install_hooks(ucm_cuda_runtime_funcs, "runtime",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to attempt installation of runtime hooks here? If we're not going to create memtype cache if driver hooks failed, can we not skip this step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if driver bistro hooks failed, we do skip the other steps:

    status = ucm_cuda_install_hooks(ucm_cuda_driver_funcs, "driver",
                                    UCM_MMAP_HOOK_BISTRO, &driver_api_hooks);
    if (status != UCS_OK) {
        **goto out_unlock;**
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean can we not remove lines 282-285?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines are needed in case the user would like to set UCX_MEM_CUDA_HOOK_MODE=reloc manually

ucs_memtype_cache_failed = 1;
if (ucs_global_opts.enable_memtype_cache == UCS_YES) {
ucs_warn("failed to create memtype cache: %s",
ucs_status_string(status));
Copy link
Contributor

@bureddy bureddy Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be a hard error if the user is forced to enable memtype cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it should be a fatal error since we can still continue the program without memtype cache

@Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh @bureddy can you pls take a look?

@yosefe If you don't mind, can you include in PR description, the current conditions under which

  1. bistro hooks fail with driver API
  2. reloc hooks fail with driver API

@yosefe
Copy link
Contributor Author

yosefe commented Feb 3, 2022

@yosefe If you don't mind, can you include in PR description, the current conditions under which

  1. bistro hooks fail with driver API
  2. reloc hooks fail with driver API

@Akshay-Venkatesh I've updated PR desc. Hope it answers the previous questions as well.

- Reloc hooks are optional but are not the default. Default is bistro
  hooks on Cuda driver API.
- If UCM fails to install the configured hooks, do not create the memory
  type cache.
- If failed to create memtype cache once, don't try again.
- Use getauxv() API if possible instead of reading /proc/self/auxv
  directly - fixes permissions errors on some systems.
- Enable Cuda bistro hooks also with valgrind, since it doesn't affect
  heap memory allocations.
- Fix error message in tests.
@yosefe yosefe force-pushed the topic/ucm-ucs-fail-to-create-memtype-cache branch from 2fd5f72 to 119a81b Compare February 3, 2022 20:35
@yosefe yosefe merged commit 5a57268 into openucx:master Feb 5, 2022
@yosefe yosefe deleted the topic/ucm-ucs-fail-to-create-memtype-cache branch February 5, 2022 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants